Gathering Alternative Surface Forms for DBpedia Entities
نویسندگان
چکیده
Wikipedia is often used a source of surface forms, or alternative reference strings for an entity, required for entity linking, disambiguation or coreference resolution tasks. Surface forms have been extracted in a number of works from Wikipedia labels, redirects, disambiguations and anchor texts of internal Wikipedia links, which we complement with anchor texts of external Wikipedia links from the Common Crawl web corpus. We tackle the problem of quality of Wikipedia-based surface forms, which has not been raised before. We create the gold standard for the dataset quality evaluation, which reveales the surprisingly low precision of the Wikipedia-based surface forms. We propose filtering approaches that allowed boosting the precision from 75% to 85% for a random entity subset, and from 45% to more than 65% for the subset of popular entities. The filtered surface form dataset as well the gold standard are made publicly available.
منابع مشابه
NTUNLP Approaches to Recognizing and Disambiguating Entities in Long and Short Text in the 2014 ERD Challenge
This paper presents the NTUNLP systems in the long track and the short track of the Entity Recognition and Disambiguation Challenge 2014. We first create a dictionary that contains the possible surface forms of Freebase Ids, then scan the given text from left to right with the longest match strategy to detect the mentions, and eliminate the unwanted surface forms based on a stop word list. Meth...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملUsing Semantic Web Resources for Solving Winograd Schemas: Sculptures, Shelves, Envy, and Success
Winograd Schemas are sentences where a pronoun must be linked to one of two possible entities in the same sentence. Deciding correctly which entity should be linked was proposed as an alternative to the Turing test. Knowledge is a critical component of solving this challenge and Linked Data resources promise to be useful to that end. We discuss two example Winograd Schemas and related knowledge...
متن کاملLinked hypernyms: Enriching DBpedia with Targeted Hypernym Discovery
The Linked Hypernyms Dataset (LHD) provides entities described by Dutch, English and German Wikipedia articles with types in the DBpedia namespace. The types are extracted from the first sentences of Wikipedia articles using Hearst pattern matching over part-of-speech annotated text and disambiguated to DBpedia concepts. The dataset covers 1.3 million RDF type triples from English Wikipedia, ou...
متن کاملThe Association Rule Mining System for Acquiring Knowledge of DBpedia from Wikipedia Categories
Wikipedia categories are a useful source of knowledge that is usually expressed in a noun-phrase that contains information about concepts of entities or relations among entities. In DBpedia KBs, they categorize their entities into Wikipedia categories using RDF triples. The RDF triples represent only categories of entities, but not concepts of entities or relations among entities despite the fa...
متن کامل